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Abstract —Attack detection problems in the smart grid are 
posed as statistical learning problems for different attack sce¬ 
narios in which the measurements are observed in batch or 
online settings. In this approach, machine learning algorithms 
are used to classify measurements as being either secure or 
attacked. An attack detection framework is provided to exploit 
any available prior knowledge about the system and surmount 
constraints arising from the sparse structure of the problem in 
the proposed approach. Well-known batch and online learning 
algorithms (supervised and semi-supervised) are employed with 
decision and feature level fusion to model the attack detection 
problem. The relationships between statistical and geometric 
properties of attack vectors employed in the attack scenarios and 
learning algorithms are analyzed to detect unobservable attacks 
using statistical learning methods. The proposed algorithms are 
examined on various IEEE test systems. Experimental analyses 
show that machine learning algorithms can detect attacks with 
performances higher than the attack detection algorithms which 
employ state vector estimation methods in the proposed attack 
detection framework. 

Index Terms —Smart grid security, sparse optimization, classi¬ 
fication, attack detection, phase transition. 

I. Introduction 

Machine learning methods have been widely proposed 
in the smart grid literature for monitoring and control of 
power systems m, o, 0 , a. Rudin et al. hi suggest an 
intelligent framework for system design in which machine 
learning algorithms are employed to predict the failures of 
system components. Anderson et al. El employ machine 
learning algorithms for the energy management of loads and 
sources in smart grid networks. Malicious activity prediction 
and intrusion detection problems have been analyzed using 
machine learning techniques at the network layer of smart 
grid communication systems a, a. 

In this paper, we focus on the false data injection attack 
detection problem in the smart grid at the physical layer. 
We use the Distributed Sparse Attacks model proposed by 
Ozay et al. s, where the attacks are directed by injecting 
false data into the local measurements observed by either 
local network operators or smart Phasor Measurement Units 
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(PMUs) in a network with a hierarchical structure, i.e. the 
measurements are grouped into clusters. In addition, network 
operators who employ statistical learning algorithms for attack 
detection know the topology of the network, measurements 
observed in the clusters and the measurement matrix 0. 

In attack detection methods that employ state vector es¬ 
timation, first the state of the system is estimated from 
the observed measurements. Then, the residual between the 
observed and the estimated measurements is computed. If the 
residual is greater than a given threshold, a data injection 
attack is declared 0 , 0 , 0 , 0 . However, exact recovery of 
state vectors is a challenge for state vector estimation based 
methods in sparse networks 0,0, ifTOl . where the Jacobian 
measurement matrix is sparse. Sparse reconstruction methods 
can be employed to solve the problem, but the performance 
of this approach is limited by the sparsity of the state vectors 
0, ED, E2- In addition, if false data injected vectors reside 
in the column space of the Jacobian measurement matrix and 
satisfy some sparsity conditions (e.g., the number of nonzero 
elements is at most k*, which is bounded by the size of 
the Jacobian matrix), then false data injection attacks, called 
unobservable attacks, cannot be detected 0, 0. 

The contributions of this paper are as follows: 

1) We conduct a detailed analysis of the techniques pro¬ 
posed by Ozay et al. lfl3l who employ supervised learn¬ 
ing algorithms to predict false data injection attacks. 
In addition, we discuss the validity of the fundamental 
assumptions of statistical learning theory in the smart 
grid. Then, we propose semi-supervised, online learning, 
decision and feature level fusion algorithms in a generic 
attack construction framework, which can be employed 
in hierarchical and topological networks for different 
attack scenarios. 

2) We analyze the geometric structure of the measurement 
space defined by measurement vectors, and the effect 
of false data injection attacks on the distance function 
of the vectors. This leads to algorithms for learning 
the distance functions, detecting unobservable attacks, 
estimating the attack strategies and predicting future 
attacks using a set of observations. 

3) We empirically show that the statistical learning algo¬ 
rithms are capable of detecting both observable and 
unobservable attacks with performance better than the 
attack detection algorithms that employ state vector 
estimation methods. In addition, phase transitions can be 
observed in the performance of Support Vector Machines 
(SVM) at a value of n* d]. 
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In the next section, the attack detection problem is formu¬ 
lated as a statistical classification problem in a network accord¬ 
ing to the model proposed by Ozay et al. 0. In Section HH we 
establish the relationship between statistical learning methods 
and attack detection problems in the smart grid. Supervised, 
semi-supervised, decision and feature level fusion, and online 
learning algorithms are used to solve the classification problem 
in Section [HU In Section JV] our approach is numerically 
evaluated on IEEE test systems. A summary of the results 
and discussion on future work are given in Section [Vl 

II. Problem Formulation 

In this section, the attack detection problem is formalized 
as a machine learning problem. 

A. False Data Injection Attacks 

False Data Injection Attacks are defined in the following 
model: 

z = Hx + n, (1) 

where x e R D contains the voltage phase angles at the buses, 
z € R N is the vector of measurements, H g R NxD is the 
measurement Jacobian matrix and n e R N is the measurement 
noise, which is assumed to have independent components 0 . 
The attack detection problem is defined as that of deciding 
whether or not there is an attack on the measurements. If 
the noise is distributed normally with zero mean, then a 
State Vector Estimation (SVE) method can be employed by 
computing 

X = (H t AH)“ 1 H t Az, (2) 

where A is a diagonal matrix whose diagonal elements 
are given by An = u^ 2 , and v 2 is the variance of nn 

Mi = 1.2.V Il7l. III. The goal of the attacker is to inject a 

false data vector a g R n into the measurements without being 
detected by the operator. The resulting observation model is 

z = Hx + a + n. (3) 

The false data injection vector, a, is a nonzero vector, such 
that * 0, Vi g A, where A is the set of indices of 
the measurement variables that will be attacked. The secure 
variables satisfy the constraint a 4 = 0, Mi g A, where A is the 
set complement of A lH3l . 

In order to detect an attack, the measurement residual 0 , 
ED is examined in ^ 2 -norm p = ||z - Hx|||, where x g R d 
is the state vector estimate. If p > r, where r g R is an 
arbitrary threshold which determines the trade-off between 
the detection and false alarm probabilities, then the network 
operator declares that the measurements are attacked. 

One of the challenging problems of this approach is that 
the Jacobian measurement matrices of power systems in the 
smart grid are sparse under the DC power flow model [13], 
01 . Therefore, the sparsity of the systems determines the 
performance of sparse state vector estimation methods lHTj . 
m . In addition, unobservable attacks can be constructed even 
if the network operator can estimate the state vector correctly. 
For instance, if a = He, where c g R d is an attack vector, then 
the attack is unobservable by using the measurement residual 


p 0, ID. In this work, we show that statistical learning 
methods can be used to detect the unobservable attacks with 
performance higher than the attack detection algorithms that 
employ a state vector estimation approach. Following the 
motivation mentioned above, a new approach is proposed 
using statistical learning methods. 


B. Attack Detection using Statistical Learning Methods 

Given a set of samples S = {s*}^ and a set of labels 
y = {yi}i^i, where (s i,yf) g S x y are independent and 
identically distributed (i.i.d.) with joint distribution P, the 
statistical learning problem can be defined as constructing a 
hypothesis function f : S -> y, that captures the relationship 
between the samples and labels ill . Then, the attack detection 
problem is defined as a binary classification problem, where 


| 1, if a^O 
j-1, if = 0 


( 4 ) 


In other words, yi = 1, if the i-th measurement is attacked, 
and yi = -1 when there is no attack. 

In this paper, the model proposed by Ozay et al. 0 is 
employed for attack construction where the measurements are 
observed in clusters in the network. Measurement matrices, 
and observation and attack vectors are partitioned into G 
blocks, denoted by Q g with \Q g \ - N g for g = 1,2,...,G. 
Therefore, the observation model is defined as 


Zl 


Hi 


ai 


ni 


= 


X + 


+ 


z G 


H g 


ac 


nc 


where z g g R Ng is the measurement observed in the g-th 
cluster of nodes through measurement matrix H g g R NgXD 
and noise g R Ng , and which is under attack a. g g R Ng 
with g = 1 , 2 ,..., G 0 . Within this framework, each observed 
measurement vector is considered as a sample, i.e., s$ = z g , 
where z g e Taking this into account, the measurements 

are classified in two groups, secure and attacked , by computing 
/(si), Vi = 1,2,..., M. 

The crucial part of the traditional attack detection algorithm, 
which we call State Vector Estimation (SVE), is the estimation 
of x. If the attack vectors, a, are constructed in the column 
space of H, then they are annihilated in the computation of 
the residual 0 . Therefore, SVE cannot detect the attacks and 
these attacks are called unobservable. On the other hand, we 
observe that the distance between the attacked and the secure 
measurement vectors is defined by the attack vector in S. If 
the attacks are unobservable, i.e. a 4 = Hc^ and a 7 = Hcj, 
where c i g R d and c j g R d are the attack vectors, then the 
distance between = z^ + a^ and z j = z j + a j is computed as 


z i z j 2 + a^ a j || 2 , 

if 

i,j e A 

z i — z j 2 + || a* || 2 , 

if 

i € A, j € A , 

2 i ~ 2 j 12 j 

if 

i,j eA 


( 6 ) 


1 For simplicity of notation, we use i as the index of measurements z i, z^, 
and attack vectors a i, Vi = 1,2,..., M. 
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where z i e S and z 7 e5. In ©, we can extract information 
on the attack vectors by observing the measurements. Since 
the distances between secure and attacked measurements are 
discriminated by the attack vectors, the attacks can be recog¬ 
nized by the learning algorithms which use the information of 
these distances, even if the attacks are unobservable. 

Two main assumptions from statistical learning theory need 
to be taken into account to classify measurements which 
satisfy ©: 

1) We assume that (s i,yi) e Sxy are distributed according 
to a joint distribution p ED. In a smart grid setting, this 
distribution assumption is satisfied for the attack models 
in which the measurements z are functions of a, and we 
can extract statistical information about both the attacked 
and secure measurements from the observations. 

2) We assume that (s Vi, are sampled from P, 
independently and identically. This assumption is also 
satisfied in the smart grid if the entries of n and a are 
i.i.d. random variables fl6l . 

In order to explain the significance of the above assumptions 
in the smart grid, we consider the following example. Assume 
that measurements 1,2 g A and 3,4 g A, are given such that 
yi,U 2 - 1 and 2/3, 2/4 = -1. Furthermore, assume that zi =31, 
Z 2 = 5 • I, Z 3 = 2 • I and Z 4 = 4 • I, where I = (1,1) T . If the 
attack vectors are identical but not independent, then the attack 
vectors can be constructed as ai = a 2 = -1 • I. As a result, we 
observe that zi = Z 3 = 2 • I and z 2 = Z 4 = 4 • I. Therefore, 
our assumption about the existence of a joint distribution P 
is not satisfied and we cannot classify the measurements with 
the aforementioned approach. 

III. Attack Detection using Machine Learning 
Methods 

In this section, the attack detection problem is modeled 
by statistical classification of measurements using machine 
learning methods. 


A. Supervised Learning Methods 

In the following, the classification function / is computed in 
a supervised learning framework by a network operator using 
a set of training data Tr = {(s*, 2/i)}i^i • The class label, y\, 
of a new observation, s', is predicted using y[ - /(s'). We 
employ four learning algorithms for attack detection. 

1) Perceptron: Given a sample sa perceptron predicts yi 
using the classification function /(s$) = sign(w • s*), where 
w € R Ni is a weight vector and sign(w-Si) is defined as ifTT) 


sign(w-Si) = 



if w • Si < 0 
otherwise. 


(7) 


In the training phase, the weights are adjusted at each 
iteration t = 1,2, ...,T of the algorithm for each training 
sample using 

w (t + 1) := w(£) + Aw, (8) 


In the testing phase, the label of a new test sample is predicted 
by /(s'i) = sign(w(T) • s'i). 

Despite its success in various machine learning applications, 
the convergence of the algorithm is assured only when the 
samples are linearly separable 113. For that reason, the 
perceptron can be successfully used for the detection of the 
attacks only if the measurements can be separated by a 
hyperplane. In the following sections, we give examples of 
classification algorithms which overcome this limitation by 
employing non-linear classification rules or feature extraction 
methods. 

2) k-Nearest Neighbor (k-NN): This algorithm labels an 
unlabeled sample s' according to the labels of its ^-nearest 
neighborhood in the feature space (17). Specifically, the ob¬ 
served measurements g S, Vi = 1,2,...,M, are taken 
as feature vectors. The set of /c-nearest neighbors of s', 
K(s') = {s i( i),s i(2) ,...,s i(fe) }, is constructed by computing 
the Euclidean distances between the samples (l8l . where 
i(l),i(2),... , i(M) are defined as 

\\ s 'i ~ s i(l) 12 ^ I s 'i ~ Si( 2 ) ||2 < • • • < |- Si(jvf) 12- (9) 

Then, the most frequently observed class label is computed 
using majority voting among the class labels of the samples 
in the neighborhood, and assigned as the class label of s' m . 
One of the challenges of fc-NN is the curse of dimensionality , 
which is the difficulty of the learning problem when the 
sample size is small compared to the dimension of the feature 
vector G3, rn, ESI. In attack detection, this problem can 
be handled using the following approaches: 

• Feature selection algorithms can be used to reduce the 
dimension of the feature vectors ED, Col- Development 
of feature selection algorithms may be a promising di¬ 
rection for smart grid security, and is an interesting topic 
for future work. 

• Kernel machines, such as SVMs, can be used to map the 
feature vectors in S to Hilbert spaces, where the feature 
vectors are processed implicitly in the mappings and the 
computation of the learning models. We give the details of 
the kernel machines and SVMs in the following sections. 

• The samples can be processed in small sizes, e.g. by 
selecting a single measurement vector as a sample, 
which leads to one-dimensional samples. We employ 
this approach in Section IV. If the sample size is large, 
distributed learning and optimization methods can be used 

a, ca. 

3) Support Vector Machines: We seek a hyperplane that 
linearly separates attacked and secure measurements into two 
half spaces using hyperplanes in a fi' dimensional feature 
space, T, which is constructed by a non-linear mapping 
4/: S -> T (13), ED . A hyperplane is represented by a weight 
vector g and a bias variable b g M, which results in 

w^r • 4/(s) + b = 0, (10) 


where Aw = 7(2/2 - and 7 is the learning rate. The 

algorithm is iterated until a stopping criterion, such as the 
number of algorithm steps, or an error threshold, is achieved. 


where 4/(s) is the feature vector of the sample that lies on the 
hyperplane in T as shown in Fig. |T] We choose the hyperplane 
that is at the largest distance from the closest positive and 
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measurements ($(s) € I : • ®(s) + b = -1} 

(a) Attack detection using the linearly separable dataset. 



Class of secure 
measurements 


(b) Attack detection using the linearly non-separable dataset. 

Fig. 1: Classification using SVM. Positive and negative sam¬ 
ples which belong to the class of attacked and secure mea¬ 
surements are depicted by disk and star markers, respectively. 
Support vectors and misclassified samples are depicted by 
dashed circles and hexagonal markers, respectively. 


negative samples. This constraint can be formulated as 

yi (• \h(s) + b) - 1 > 0, Mi = 1,2,... ,M Tr . (11) 

Since d+ = d- = ,, 1 ,, , where d+ and cL are the shortest 
distances from the hyperplane to the closest positive and 
negative samples respectively, a maximum margin hyperplane 
can be computed by minimizing || w^r || 2- 

If the training examples in the transformed space are not 
linearly separable (see Fig. l.b), then the optimization prob¬ 
lem can be modified by introducing slack variables ^ > 0, 
Vz = 1,2,..., M Tr , in (fill) which yields 

Ui( W* ■ tf(si) + b) - 1 + & > 0 , Vi = 1,2,M Tr . (12) 

The hyperplane is computed by solving the following 
optimization problem in primal or dual form lET) . l22l . [23) 

M Jr 

minimize ||w^||| + C XI & 

(13) 

Vi( w^r • ^(Si) + b) - 1 + & > 0 

&>0, Vi = 1,2,... , M Tr 


where C is a constant that penalizes (an upper bound on) the 
training error of the soft margin SVM. 

4) Sparse Logistic Regression: In utilizing this approach for 
attack detection, we solve the classification problem using the 
Alternating Direction Method of Multipliers (ADMM) (24J 
considering the sparse state vector estimation approach of 
Ozay et al. 0. Note that, the hyperplanes defined in (TTOl) can 
be computed by employing the generalized logistic regression 
models presented in na, which provide the distributions 


P(Ui\ s i) = 


1 


1 + exp(-j/j(w • s, + b)) ’ 

1 

1 + exp(-j/i(w<j; • ¥(*) + b )) 


(14) 
, (15) 


in S and T, respectively. For this purpose, we minimize the 
logistic loss functions 


£(s i,yi) = log (1 + exp (~yi (w • s* + b ))), (16) 

= log (1 + exp(- 2 /i(w^ • tf(si) + b ))). (17) 

Defining a feature matrix S = (sf, s^,..., sM tr ,) T an d a ^ a ^ e ^ 
vector Y = (2/1,2/2, • • •, UM tr ) T , the ADMM optimization 
problem [24l is constructed as 

minimize C{ S,Y)+/i(r) 
subject to w-r = 0 

where w is a weight vector, r is a vector of optimization 
variables, /i(r) = A||r||i is a regularization function, and A is 
a regularization parameter which is introduced to control the 
sparsity of the solution [24). 


B. Semi-supervised Learning Methods 

In semi-supervised learning methods, the information ob¬ 
tained from the unlabeled test samples is used during the 
computation of the learning models [25). 

In this section, a semi-supervised Support Vector Machine 
algorithm, called Semi-supervised SVM (S3VM) [26) . 127) 
is employed to establish the analytical relationship between 
supervised and semi-supervised learning algorithms. In this 
setting, the unlabeled samples are incorporated into cost func¬ 
tion of the optimization problem (fl3l) as 

M Jr M Je 

minimize ||w||| + C\ ^ L Tr (si,^) + C 2 Xj L Je (s'i ), (19) 

where C\ and C 2 are confidence parameters, and L Tr (s^, yi) = 
max(0,1 -yi(wsi + 6)) and L Je (s'i) = max(0,1 - ||s^||i) are 
the loss functions of the training and test samples, respectively. 

The main assumption of the S3VM is that the samples in 
the same cluster have the same labels and the number of sub¬ 
clusters is not large ED. In other words, attacked and secure 
measurement vectors should be clustered in distinct regions 
in the feature spaces. Moreover, the difference between the 
number of attacked and secure measurements should not be 
large in order to avoid the formation of sub-clusters. 

This requirement can be validated by analyzing the feature 
space. Following ©, if ||z* -Zj || 2 + ||a* -ay || 2 < ||a*| 2 +Jaj || 2 , 
and \\zk -Z/I 2 < |1 2 + ||a^|| 2 , Vi,j e A and Vfc,Z € A, then 


subject to 
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the samples belonging to different classes are well-separated 
in different classes. Moreover, this requirement is satisfied in 
(IT9b by adjusting C 2 (27) . A survey of the methods which are 
used to provide optimal C 2 and solve ([T9b is given in [27]. 


C. Decision and Feature Level Fusion Methods 


One of the challenges of statistical learning theory is to find 
a classification rule that performs better than a set of rules of 
individual classifiers, or to find a feature set that represents the 
samples better than a set of individual features. One approach 
to solve this problem is to combine a collection of classifiers 
or a set of features to boost the performance of the individual 
classifiers. The former approach is called decision level fusion 
or ensemble learning, and the latter approach is called feature 
level fusion. In this section, we consider Adaboost [28 ] and 
Multiple Kernel Learning (MKL) l29l for ensemble learning 
and feature level fusion. 

1) Ensemble Learning for Decision Level Fusion: Various 
methods such as bagging, boosting and stacking have been 
developed to combine classifiers in ensemble learning situa¬ 
tions ED, tH. In the following, Adaboost is explained as 
an ensemble learning approach, in which a collection of weak 
classifiers are generated and combined using a combination 
rule to construct a stronger classifier which performs better 
than the weak classifiers ED, Hi, ED. 

At each iteration t = 1,2,..., T of the algorithm, a decision 
or hypothesis /*(•) of the weak classifier is computed with 
respect to the distribution on the training samples D t (•) at t by 

M Tr 

minimizing the weighted error e t = £ D t (i)I(ft(si) t yf), 

i= 1 

where /(•) is the indicator function. The distribution is ini¬ 
tialized uniformly Di(i) = at t = 1, and is updated by a 
parameter OL t -\ log(^ i ) as follows BTl 


Dt+i(i) 


D t (i) exp 
Zt 


( 20 ) 


where Z t is a normalization parameter, called the partition 
function. At the output of the algorithm, a strong classi¬ 
fier H{‘ ) is constructed for a sample s' using H(s') = 

sign(£ a t ft( s'))- 
t =1 

2) Multiple Kernel Learning for Feature Level Fusion: Fea¬ 
ture level fusion methods combine the feature spaces instead 
of the decisions of the classifiers. One of the feature level 
fusion methods is MKL in which different feature mappings 
are represented by kernels that are combined to produce a 
new kernel which represents the samples better than the other 
kernels (29). Therefore, MKL provides an approach to solve 
the feature mapping selection problem of SVM. In order to 
see this relationship, we first give the dual form of (f]~3t 


maximize 
subject to 


M Tr M Tr M Tr 

E P%~\ E E PiPjViVjkisi^j) 

i= 1 i=1 j=1 

M Jr 

E Pun = 0 

i=1 

0 < Pi < C, Vi = 1,2,..., M Tr , 


( 21 ) 


where Pi is the dual variable and fc(s,,Sj) = • 'h(s y) is 

the kernel function. Therefore, <ED is a single kernel learning 


algorithm which employs a single kernel matrix K e M m xM 

with elements K(i,j) = k(si,Sj). If we define the weighted 

u 

combination of U kernels as K = £ d u K u , where d u > 0 are 

u= 1 
U 

the normalized weights such that £ d u = 1, then we obtain 

u= 1 _ 

the following optimization problem of the MKL 021 : 

M Tr M Tr M Jr U 

maximize E Pi~\ E E PiPjViVj E d u K u (si,Sj) 

2=1 2 = 1 j=1 u=1 

M Tr (22) 

subject to £ PiUi = 0 

2=1 

0<Pi<C, Vi = 1,2,..., M Tr . 

In ([22b . the kernels with d u = 0 are eliminated, and therefore 
MKL can be considered as a kernel selection method. In the 
experiments, SVM algorithms are implemented with different 
kernels and these kernels are combined under MKL. 


D. Online Learning Methods for Real-time Attack Detection 

In the smart grid, the measurements are observed in real¬ 
time where the samples are collected sequentially in time. 
In this scenario, we relax the distribution assumption of 
Section II.B, since the samples are observed in an arbitrary 
sequence [33 j. Moreover, smart PMUs which employ learning 
algorithms, may be required to detect the attacks when the 
measurements are observed without processing the whole set 
of training samples. In order to solve these challenging prob¬ 
lems, we may use online versions of the learning algorithms 
given in the previous sections. 

In a general online learning setting, a sequence of training 
samples (or a single sample) is given to the learning algorithm 
at each observation or algorithm processing time. Then, the 
algorithm computes the learning model using only the given 
samples and predicts the labels. The learning model is updated 
with respect to the error of the algorithm which is computed 
using a loss function on the given samples. Therefore, the 
perceptron and Adaboost are convenient for online learning in 
this setting. For instance, an online perceptron is implemented 
by predicting the label yi of a single sample s i at each 
time t, and updating the weight vector w using Aw for the 
misclassified samples with yi * sign(/(si)) (34l . This simple 
approach is applied for the development of online MKL ff34l 
and regression algorithms (35) . 


E. Performance Analysis 

In smart grid networks, the major concern is not just the 
detection of attacked variables, but also that of the secure 
variables with high performance. In other words, we require 
the algorithms to predict the samples with high precision and 
recall performance in order to avoid false alarms. Therefore, 
we measure the true positives (tp), the true negatives (tri), the 
false positives (fp ), and the false negatives (fn), which are 
defined in Table [I] 

In addition, the learning abilities and memorization proper¬ 
ties of the algorithms are measured by Precision (. Prec ), Recall 
(Rec) and Accuracy ( Acc ) values which are defined as (13 1 


Prec = 


tp 

tp+fp 5 


Rec = 


tp 

tp+fn 5 


A rr = tP +tn 

tp+tn+fp+fn ’ 


(23) 
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TABLE I: Definitions of performance measures 


Classified as Attacked 
Classified as Secure 


Attacked Secure 


tp 

fp 

fn 

tn 


Precision values give information about the prediction per¬ 
formance of the algorithms. On the other hand, Recall values 
measure the degree of attack retrieval. Finally, the total 
classification performance of the algorithms is measured by 
Accuracy. For instance, if Prec = 1, then none of the secure 
measurements is misclassified as attacked. If Rec = 1, then 
none of the attacked measurements is misclassified as secure. 
If Acc = 1, then each measurement classified as attacked is 
actually exposed to an attack, and each measurement classified 
as secure is actually a secure measurement. 

IV. Experiments 

The classification algorithms are analyzed in IEEE 9-bus, 
57-bus and 118-bus test systems in the experiments. The 
measurement matrices H of the systems are obtained from 
the MATPOWER toolbox l36l . The operating points of the test 
systems provided in the MATPOWER case files are used in 
generating z. Training and test data are generated by repeating 
this process 50 times for each simulated point and dataset. 
In the experiments, we assume that the attacker has access 
to k, measurements which are randomly chosen to generate 
a /^-sparse attack vector a with Gaussian distributed nonzero 
elements with the same mean and variance as the entries of z 
0 , im m. We assume that concept drift |[37l and dataset 
shift f38l do not occur. Therefore, we use G = N in the 
simulations following the results of Ozay et al. 0. 

We analyze the behavior of each algorithm on each system 
for both observable and unobservable attacks by generating 
attack vectors with different values of € [0,1]. More 
precisely, if n > N - D + 1, then attack vectors that are not 
observable by SVE, i.e. unobservable attacks, are generated 
a. Otherwise, the generated attacks are observable. 

The LIBSVM [39] implementation is used for the SVM, 
and the ADMM (24) implementation is used for Sparse 
Logistic Regression (SLR), k values of t he fc-N N algorithm 
are optimized by searching k e { 1 , 2 ,..., %/M Tr } using leave- 
one-out cross-validation, where M Tr is the number of training 
samples. Both the linear and Gaussian kernels are used for the 
implementation of SVM. A grid search method [39), [40], BTl 
is employed to search the parameters of the SVM in an interval 
X = [Zmin,Zmax\, where Imin and X max are user defined 
values. In order to follow linear paths in the search space, log 
values of parameters are considered in the grid search method 
ED. Keerthi and Lin [41] analyzed the asymptotic properties 
of the SVM for X = [0, oo). In the experiments, l m i n = -10 is 
chosen to compute a lower limit 2 -10 of the parameter values 
following the theoretical results given in (39l and fiTl . Since 
the classification performance of the SVM does not change 
for parameter values that are greater than a threshold [41], we 
used X max = 10 as employed in the experimental analyses in 
ED- Therefore, the kernel width parameter cr of a Gaussian 


kernel is searched in the interval log(cr) € [-10,10] and the 
cost penalization parameter C of the SVM is searched in the 
interval log(C) e [-10,10]. The regularization parameter of 
the SLR is computed as 

A = A maxi (24) 

where A max - || Hz || <*, determines the critical value of A above 
which the solution of the ADMM problem is 0 and H is 
searched for in the interval D e [10 -3 ,1] 0, (24), (42) . An 
optimal A is computed by analyzing the solution (or regu¬ 
larization) path of the LASSO type optimization algorithms 
using a given training dataset. As the sparsity of the systems 
that generate datasets increases, lower values are calculated for 
f2 0, (24), El. The absolute and relative tolerances, which 
determine values of upper bounds on the Euclidean norms 
of primal and dual residuals, are chosen as 1CT 4 and 1CT 2 , 
respectively Il24l . The penalty parameter is initially selected as 
1 and dynamically updated at each iteration of the algorithm 
El. The maximum number of iterations is chosen as 10 4 0, 

E). 

In the experiments, we observe that the selection of toler¬ 
ance parameters does not affect the convergence rates if their 
relative values do not change. In addition, selection of the 
initial value of the penalty parameter also does not affect the 
convergence rate if relative values of tolerance parameters are 
fixed (24ll . For instance, similar convergence rates are observed 
when we chose 1CT 4 and 1CT 2 , or 1CT 6 and 10 -4 , as tolerance 
parameters. ||z - Hxb|| < r is computed in order to decide 
whether there is an attack using the SVE and assuming a chi- 
square test with 95% confidence in the computation of r 0, 

na. 

A. Results for Supervised Learning Algorithms 

The performance of different algorithms is compared for 
the IEEE 57-bus system in Fig. [2] Accuracy values of the 
SVE and perceptron increase as increases in Fig. [2] a and 
Fig. Hb. Additionally, Recall values of the SVE increase 
linearly as increases. Precision values of the perceptron are 
high and do not decrease, and Accuracy and Recall values 
increase, since fn values decrease and tn values increase. 
In Fig. He, a phase transition around k* = N - D + 1, is 
observed for the performance of the SVM. Since the distance 
between measurement vectors of attacked and secure variables 
increases as increases following we observe that the 
Accuracy, Precision and Recall values of the k- NN increase 
in Fig. Hd. Accuracy and Recall values of the k- NN and SLR 
are above 0.9 and do not change as increases in Fig. He. 

The class-based performance values of the algorithms are 
measured using class-wise performance indices, where Class-1 
and Class-2 denotes the class of attacked and secure variables, 
respectively. The class-wise performance indices are defined 
as follows: 

Class- 1: Prec- 1 = ^, Rec - 1 = (25) 

tn 

Class-2: Prec - 2 = , , Rec-2= --(26) 

tn+/n’ fp + tn 
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Linear SVM k -NN 



SLR 



Fig. 2: Results for the IEEE 57-bus system. Accuracy values of 
the SVE and perceptron increase while Precision values of the 
k -NN and SLR increase as increases. Both Accuracy and 
Precision values of the SVM increase and phase transitions 
occur. 


In Fig. [3] a, we observe that the Precision, Recall and 
Accuracy values of the SVE increase as increases for Class- 
1. Note that the first value of Acc-1 is observed at 0.008. 
In Fig. [3]b, Precision values for Class-2 decrease with the 
percentage of attacked variables, i.e. the number of secure 
variables that are incorrectly classified by the SVE increases as 
the number of attacked variables increases. Although the SVE 
may correctly detect the attacked variables as increases, the 
secure variables are incorrectly labelled as attacked variables, 
and therefore, the SVE gives more false alarms than the other 
algorithms. 

Performance values for the perceptron are given in Fig. 
01 We observe that Precision values for Class-1 increase and 
Recall values do not change drastically for both of the classes 
as increases. Moreover, we do not observe any performance 
increase for the Recall values of the secure class in the 
perceptron. 



(a) Performance values for Class-1. (b) Performance values for Class-2. 

Fig. 3: Experiments using the SVE for the IEEE 57-bus test 
system. Note that fp values increase as increases. 


Perceptron Perceptron 



Perceptron Perceptron 



Fig. 4: Performance analysis of the perceptron. 

In Fig. \5\ the results for k -NN are shown. We observe 
that performance values for Class-1 increase and the values 
for Class-2 decrease as increases since fc-NN is sensitive 
to class-balance and sparsity of the data El. In addition, 
classification hypotheses are computed by forming neigh¬ 
borhoods in Euclidean spaces, and the £2 norm of vectors 
of attacked measurements increases as increases in ©; 
therefore, decision boundaries of the hypotheses are biased 
towards Class-1. 

Fig. [6] depicts the results for the SLR, where the per¬ 
formance values for Class-2 (secure variables) increase as 
the system size increases. Moreover, we observe that the 
performance values for Class-2 do not decrease rapidly as 
jr increases, compared to the other supervised algorithms. In 
addition, the performance values for Class-1 are higher than 
the values of the other algorithms, especially for lower 
values. The reason is that the SLR can handle the variety in the 
sparsity of the data as changes. This task is accomplished 

































k- NN fc-NN 



(a) Prec. values for the IEEE 57-bus. (b) Rec. values for the IEEE 57-bus. 


fc-NN k- NN 



(c) Prec. values for the IEEE 118-bus. (d) Rec. values for the IEEE 118-bus. 

Fig. 5: Since the fc-NN is sensitive to class-balance and 
sparsity of the data, performance values for Class-1 increase 
and the values for Class-2 decrease as increases. Note that 
the performance curves intersect at the critical values k* . 

by controlling and learning the sparsity of the solution in (f]~8l) 
using the training data in order to learn the sparse structure of 
the measurements defined in the observation model ©. 

The results of the experiments for the SVM are shown in 
Fig. U\ where a phase transition for the performance values is 
observed. It is worth noting that the values of n at which the 
phase transition occurs correspond to the minimum number 
of measurement variables, k* 9 that the attacker needs to 
compromise in order to construct unobservable attacks 
is depicted as a vertical dotted line in Fig. [71 For instance, 
k* = 10 and jj- = 0.56 for the IEEE 9-bus test system. 
The transitions are observed before the critical points when 
the linear kernel SVM is employed in the experiments for 
IEEE 57-bus and 118-bus systems. In addition, the phase 
transitions of performance values occur at the critical points 
when Gaussian kernels are used. 

B. Results for Semi-supervised Learning Algorithms 

We use the S3VM with default parameters as suggested 
in EH. The results of the semi-supervised SVM are shown 
in Fig. [5] We do not observe sharp phase transitions in the 
semi-supervised SVM unlike the supervised SVM, since the 
information obtained from unlabeled data contributes to the 
performance values in the computation of the learning models. 
For instance, Precision values of Class-2 decrease sharply 
near the critical point for the supervised SVM in Fig. [7 1 
However, the semi-supervised SVM employs the unlabeled 
samples during the computation of the learning model in (fl9l) , 
and partially solves this problem. 



(a) Results for the IEEE 57-bus. (b) Results for the IEEE 57-bus. 



(c) Results for the IEEE 118-bus. (d) Results for the IEEE 118-bus. 

Fig. 6: Experiments using the SLR. Note that the SLR can 
handle the variety in the sparsity of the data as changes. 


C. Results for Decision and Feature Level Fusion Algorithms 

In this section, we analyze Adaboost and MKL. Decision 
stumps are used as weak classifiers in Adaboost E) . Each 
decision stump is a single-level two-leaf binary decision tree 
which is used to construct a set of dichotomies consisting 
of binary labelings of samples ED . The number of weak 
classifiers is selected using leave-one-out cross-validation in 
the training set. We use MKL with a linear and a Gaussian 
kernel with the default parameters suggested in the Simple 
MKL implementation (32) . The results given in Fig. [9] show 
that Recall values of MKL for Class-1 are less than the 
values of Adaboost. In addition, Precision values of MKL 
decrease faster than the values of Adaboost as increases 
for Class-2. Therefore, the fn values of MKL are greater than 
the values of Adaboost, or in other words, the number of 
attacked measurements misclassified as secure by MKL is 
greater than that of Adaboost. This phenomenon is observed in 
the results for semi-supervised and supervised SVM given in 
the previous sections. However, there are no phase transitions 
of the performance values of MKL compared to the supervised 
SVM. 

D. Results for Online Learning Algorithms 

We consider four online learning algorithms, namely Online 
Perceptron (OP), Online Perceptron with Weighted Models 
(OPWM), Online SVM and Online SLR. Note that these algo¬ 
rithms are the online versions of the batch learning algorithms 
given in Section IIII-AI and developed considering the online 
algorithm design approach given in Section IIII-D1 The details 
of the implementations of the OP, OPWM, Online SVM and 
SLR are given in (34), (35) and (45) . 
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Linear SVM for the IEEE 9-bus 



(a) Linear SVM. 


Linear SVM for the IEEE 57-bus 



(e) Linear SVM. 


Linear SVM for the IEEE 118-bus 



(i) Linear SVM. 


Linear SVM for the IEEE 9-bus 



(b) Linear SVM. 


Linear SVM for the IEEE 57-bus 



(f) Linear SVM. 


Linear SVM for the IEEE 118-bus 



(j) Linear SVM. 


Gaussian SVM for the IEEE 9-bus 



(c) Gaussian SVM. 


Gaussian SVM for the IEEE 57-bus 



(g) Gaussian SVM. 


Gaussian SVM for the IEEE 118-bus 



(k) Gaussian SVM. 


Gaussian SVM for the IEEE 9-bus 



(d) Gaussian SVM. 


Gaussian SVM for the IEEE 57-bus 



(h) Gaussian SVM. 


Gaussian SVM for the IEEE 118-bus 






! -©- Rec-1 and Rec-2 
-Critical Point ( 

1 

0 0.2 0.4 0.6™'W~^ 'l 


k/N 

(1) Gaussian SVM. 


Fig. 7: Experiments using the SVM with linear and Gaussian kernels. Phase transitions of performance values occur at the 
critical point k* . See the text for more detailed explanation. 


When the OP is used, only the model w(£) computed using 
the last observed measurement at time t is considered for the 

classification of the test samples. On the other hand, we con- 

T 

sider an average of the models w ave (t) = E w(t) which is 

t =l 

computed by minimizing margin errors in the OPWM. Results 
are given for the OP in Fig. [TOj In the weighted models, we 
observe phase transitions of the performance values for Class- 
2 in Fig. [TOje-Fig. [I0]h. However, the phase transitions occur 
before the critical values, and the values of the phase transition 
points decrease as the system size increases. Additionally, we 
do not observe sharp phase transitions in the OP. 

In the OP, if the label of a measurement s is not correctly 
labeled, then the measurement vector is added to a set of sup¬ 
porting measurements § that are used to update the hypotheses 
in the training process. However, the hypotheses are updated 
in the OPWM if a measurement s' is not correctly labeled, and 
the vectors of s' and s € § are linearly independent. Since the 
smallest number of linearly dependent measurements increases 
as increases o, Ei, the size of S decreases and the bias 
is decreased towards Class-1. Therefore, false negative (/n) 
values decrease and false positive ( fp ) values increase K71 . As 


a result, we observe that Recall values of the OP are less than 
that of the OPWM for Class-1. The results of the Online SVM 
and Online SLR are provided in Fig. [10] for different IEEE test 
systems. We observe phase transitions of performance values 
in the Online SVM similar to the batch supervised SVM. 

Learning curves of online learning algorithms are given in 
Fig. [IT] for both observable attacks generated with = 0.33 
and unobservable attacks generated with =0.66. Since the 
cost function of each online learning algorithm is different, 
the learning performance is measured and depicted using 
accuracy ( Acc ) defined in (l23t . In the results, performance 
values of the Online SVM and OPWM increase as the number 
of samples increases, since the algorithms employ margin 
learning approaches which provide better learning rates as the 
number of training samples increases |34]L B31 . 

Briefly, we suggest using Online SLR for the scenarios in 
which the precision of the classification of secure variables 
is important to avoid false alarms. On the other hand, if the 
classification of attacked variables with high Precision and 
Recall values is an important task, we suggest using the Online 
Perceptron. 
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Semi-supervised SVM with Linear Kernel Semi-supervised SVM with Linear Kernel Semi-supervised SVM with Linear Kernel Semi-supervised SVM with Linear Kernel 



Semi-supervised SVM with Gaussian Kernel Semi-supervised SVM with Gaussian Kernel Semi-supervised SVM with Gaussian Kernel Semi-supervised SVM with Gaussian Kernel 



Fig. 8: Sharp phase transitions are not observed in the semi-supervised SVM unlike the supervised SVM, since the information 
obtained from unlabeled data contributes to the performance values in the computation of the learning models. 


Adaboost 


Adaboost 


Adaboost 


Adaboost 



MKL 


MKL 


MKL 


MKL 



K/N k/N k/N k/N 


(e) MKL for the IEEE 57-bus. 


(f) MKL for the IEEE 57-bus. 


(g) MKL for the IEEE 118-bus. (h) MKL for the IEEE 118-bus. 


Fig. 9: Experiments using Adaboost and MKL. Note that th t fn values of MKL are greater than the values of Adaboost, and 
there are no phase transitions of the performance values of MKL compared to the supervised SVM. 
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Online Perceptron Online Perceptron Online Perceptron Online Perceptron 



Online Perceptron with Weighted Models Online Perceptron with Weighted Models Online Perceptron with Weighted Models Online Perceptron with Weighted Models 



Online SVM 


Online SVM 


Online SVM 


Online SVM 



(i) Online SVM for the IEEE 57-bus. (j) Online SVM for the IEEE 57-bus. (k) Online SVM for the IEEE 118-bus. (1) Online SVM for the IEEE 118-bus. 


Online Sparse Logistic Regression Online Sparse Logistic Regression Online Sparse Logistic Regression Online Sparse Logistic Regression 



(m) Online SLR for the IEEE 57-bus. (n) Online SLR for the IEEE 57-bus. (o) Online SLR for the IEEE 118-bus.(p) Online SLR for the IEEE 118-bus. 


Fig. 10: Experiments using the Online Perceptron (OP), Online Perceptron with Weighted Models (OPWM), Online SVM and 
SLR. Recall values of the OP are less than that of the OPWM for Class-1. Multiple phase transitions of performance values 
of the Online SVM are observed in the IEEE 118-bus system. 


V. Summary and Conclusion 

The attack detection problem has been reformulated as a 
machine learning problem and the performance of supervised, 
semi-supervised, classifier and feature space fusion and online 
learning algorithms have been analyzed for different attack 
scenarios. 

In a supervised binary classification problem, the attacked 
and secure measurements are labeled in two separate classes. 


In the experiments, we have observed that state of the art 
machine learning algorithms perform better than the well- 
known attack detection algorithms which employ a state vector 
estimation approach for the detection of both observable and 
unobservable attacks. 

We have observed that the perceptron is less sensitive and 
the k -NN is more sensitive to the system size than the other 
algorithms. In addition, the imbalanced data problem affects 
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Attacks generated with =0.32895 Attacks generated with =0.65789 



Fig. 11: Learning curves of online learning algorithms. 


the performance of the k- NN. Therefore, k -NN may perform 
better in small sized systems and worse in large sized systems 
when compared to other algorithms. The SVM performs 
better than the other algorithms in large-scale systems. In the 
performance tests of the SVM, we observe a phase transition 
at ft*, which is the minimum number of measurements that are 
required to be accessible by the attackers in order to construct 
unobservable attacks. Moreover, a large value of ft does not 
necessary imply high impact of data injection attacks. For 
example, if the attack vector a has small values in all elements, 
then the impact of a may still be limited. More important, if 
a is a vector with small values compared to the noise, then 
even machine learning-based approaches may fail. 

We observe two challenges of SVMs in their application to 
attack detection problems in smart grid. First, the performance 
of the SVM is affected by the selection of kernel types. For 
instance, we observe that the linear and Gaussian kernel SVM 
perform similarly in the IEEE 9-bus system. However, for the 
IEEE 57-bus system the Gaussian kernel SVM outperforms 
its linear counterparts. Moreover, the values of the phase 
transition points of the performance of the Gaussian kernel 
SVM coincide with the theoretically computed ft* values. This 
implies that the feature vectors in T, which are computed 
using Gaussian kernels, are linearly separable for higher values 
of ft. Interestingly, the transition points miss ft* in the IEEE 
118-bus system, which means that alternative kernels are 
needed for this system. Second, the SVM is sensitive to 
the sparsity of the systems. In order to solve this problem, 
sparse SVM [48]] and kernel machines 09) can be employed. 
In this paper, we approached this problem using the SLR. 
However, obtaining an optimal regularization parameter, A, is 
computationally challenging ff24l . 

In order to use information extracted from test data in the 
computation of the learning models, semi-supervised methods 
have been employed in the proposed approach. In semi- 
supervised learning algorithms, we have used test data to¬ 
gether with training data in an optimization algorithm used 
to compute the learning model. The numerical results show 
that the semi-supervised learning methods are more robust to 
the degree of sparsity of the data than the supervised learning 
methods. 

We have employed Adaboost and MKL as decision and 
feature level fusion algorithms. Experimental results show that 


fusion methods provide learning models that are more robust 
to changes in the system size and data sparsity than the other 
methods. On the other hand, computational complexities of 
most of the classifier and feature fusion methods are higher 
than that of the single classifier and feature extraction methods. 

Finally, we have analyzed online learning methods for real¬ 
time attack detection problems. Since a sequence of training 
samples or just a single sample is processed at each time, the 
computational complexity of most of the online algorithms is 
less than the batch learning algorithms. In the experiments, we 
have observed that classification performance of online learn¬ 
ing algorithms are comparable to that of the batch algorithms. 

In future work, we plan to first apply the proposed approach 
and the methods to an attack classification problem for decid¬ 
ing which of several possible attack types have occurred given 
that an attack have been detected. Then, we plan to consider 
the relationship between measurement noise and bias-variance 
properties of learning models for the development of attack 
detection and classification algorithms. Additionally, we plan 
to expand our analyses for varying number of clusters G and 
cluster sizes N g , Mg - 1,2,..., G, by relaxing the assumptions 
made in this work for attack detection in smart grid systems, 
e.g. when the samples are not independent and identically 
distributed and obtained from non-stationary distributions, in 
other words, concept drift m and dataset shift [381 occur. 
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